Syllabus

Week Topic
1
  • Syllabus review.

  • Installation of R and RStudio. Manage packages.

  • Introduction of Github.

  • Set up project

2
  • R Basics
3
  • Summarizing data

  • Generate publication-ready table using tableby

4

Plotting using R, basics

  • Anatomy of `ggplot`

  • Scatter plots

  • Line plots

  • Bar plots

  • Histograms

  • Multiple geoms, multiple aes()

5

Plotting with R, tuning plots

  • Scales

  • Colors

  • Titles and labels

  • Themes

  • Save your plot

6

Plotting Phylogenetic trees with R

  • Basic of phylogenetic tree

  • Packages to manage trees in R

  • Loading phylogenetic tree in R

  • Plot tree using ggtree

7

Plotting Phylogenetic trees with R

  • Link tree with data

  • Plot tree with data

  • Visual exploration of phylogentic trees

8 Data analysis - Continuous data
9 Data analysis - Linear regression
10 Data analysis - Categorical data
11 Data analysis - Logistic regression
12 Data analysis - Time-to-event data and survival
13 Visualize RNA-seq data, part 1
14 Visualize RNA-seq data, part 2

What is unique in this class?

  • Using real biomedical data and cases as examples
  • Use the latest packages and features
  • Publication-ready graphics and tables

Examples

1. Summarizing data and make tables using a simple package

Load required libraries

library(tidyverse)
library(table1)
library(knitr)
library(arsenal)
library(patchwork)
library(GGally)

Load data from a csv file.

dfall <- read_csv('data/data.csv')

dfall <- dfall %>% 
  mutate_at(c("TCS_PR","TCS_RT","TCS_IN", "TCS_V1V3", "PI_RT","PI_V1V3","DIST20_RT", "DIST20_V1V3"), funs(as.numeric)) %>% 
  mutate(
  racecat = factor(racecat, levels = c("White", "Black", "Hispanic", "Other/Unkn")),
  risk2 = factor(risk2, levels = c("MSM", "HET-F", "HET-M", 'PWID-F', 'PWID-M', 'OTHER/UNKN'))
  )

labels(dfall) <- c(ngscollectyr = 'Year of Diagnosis', 
                   gender = 'Gender',
                   racecat = 'Race',
                   age_cat30 = 'Age ≤ 30y/o',
                   risk2 = 'Risk factor',
                   recent_cat = 'Recency Category',
                   owning_jd_region_fsu = 'Region in NC',
                   incluster = 'In Clusters',
                   cd4_value = 'CD4 count (cells/µL)',
                   vl_log_value = 'Viral Load (Log10 copies/mL)'
)

Summarize the data, perform statistically comparison and generate publication-ready table.

Table 1. Characteristics of sequenced participants with new diagnoses in NC from 2018-2021.

dfall %>% 
tableby(ngscollectyr ~ recent_cat + gender + racecat + age_cat30 + risk2 + owning_jd_region_fsu + incluster + cd4_value + vl_log_value, 
       data = .,cat.simplify=F, numeric.stats= c("median","q1q3"), test=T) %>% 
 summary(.,digits=1, digits.count=0, digits.pct=1, digits.p=2, title=NULL)
2018 (N=270) 2019 (N=237) 2020 (N=112) 2021 (N=195) Total (N=814) p value
Recency Category 0.02
   Chronic 131 (48.5%) 109 (46.0%) 43 (38.4%) 97 (49.7%) 380 (46.7%)
   Indeterminant 44 (16.3%) 21 (8.9%) 16 (14.3%) 31 (15.9%) 112 (13.8%)
   Recent 95 (35.2%) 107 (45.1%) 53 (47.3%) 67 (34.4%) 322 (39.6%)
Gender 0.08
   Female 44 (16.3%) 29 (12.2%) 7 (6.2%) 29 (14.9%) 109 (13.4%)
   Male 222 (82.2%) 202 (85.2%) 100 (89.3%) 163 (83.6%) 687 (84.4%)
   Transgender Female 4 (1.5%) 6 (2.5%) 4 (3.6%) 3 (1.5%) 17 (2.1%)
   Transgender Male 0 (0.0%) 0 (0.0%) 1 (0.9%) 0 (0.0%) 1 (0.1%)
Race 0.04
   White 46 (17.0%) 38 (16.0%) 26 (23.2%) 29 (14.9%) 139 (17.1%)
   Black 191 (70.7%) 147 (62.0%) 70 (62.5%) 126 (64.6%) 534 (65.6%)
   Hispanic 26 (9.6%) 30 (12.7%) 10 (8.9%) 28 (14.4%) 94 (11.5%)
   Other/Unkn 7 (2.6%) 22 (9.3%) 6 (5.4%) 12 (6.2%) 47 (5.8%)
Age ≤ 30y/o < 0.01
   No 109 (40.4%) 76 (32.1%) 33 (29.5%) 90 (46.2%) 308 (37.8%)
   Yes 161 (59.6%) 161 (67.9%) 79 (70.5%) 105 (53.8%) 506 (62.2%)
Risk factor < 0.01
   MSM 183 (67.8%) 169 (71.3%) 82 (73.2%) 127 (65.1%) 561 (68.9%)
   HET-F 39 (14.4%) 25 (10.5%) 5 (4.5%) 6 (3.1%) 75 (9.2%)
   HET-M 38 (14.1%) 28 (11.8%) 6 (5.4%) 9 (4.6%) 81 (10.0%)
   PWID-F 5 (1.9%) 0 (0.0%) 0 (0.0%) 1 (0.5%) 6 (0.7%)
   PWID-M 5 (1.9%) 9 (3.8%) 8 (7.1%) 5 (2.6%) 27 (3.3%)
   OTHER/UNKN 0 (0.0%) 6 (2.5%) 11 (9.8%) 47 (24.1%) 64 (7.9%)
Region in NC < 0.01
   Asheville 21 (7.8%) 22 (9.3%) 8 (7.1%) 12 (6.2%) 63 (7.7%)
   Charlotte 48 (17.8%) 12 (5.1%) 5 (4.5%) 11 (5.6%) 76 (9.3%)
   Fayetteville 26 (9.6%) 32 (13.5%) 17 (15.2%) 19 (9.7%) 94 (11.5%)
   Greensboro 66 (24.4%) 70 (29.5%) 30 (26.8%) 46 (23.6%) 212 (26.0%)
   Raleigh 54 (20.0%) 57 (24.1%) 19 (17.0%) 52 (26.7%) 182 (22.4%)
   Wilmington 9 (3.3%) 8 (3.4%) 12 (10.7%) 16 (8.2%) 45 (5.5%)
   Winterville 46 (17.0%) 36 (15.2%) 21 (18.8%) 39 (20.0%) 142 (17.4%)
In Clusters 0.08
   No 89 (33.0%) 73 (30.8%) 26 (23.2%) 73 (37.4%) 261 (32.1%)
   Yes 181 (67.0%) 164 (69.2%) 86 (76.8%) 122 (62.6%) 553 (67.9%)
CD4 count (cells/µL) 0.55
   Median 407.0 432.0 428.0 394.0 419.0
   Q1, Q3 275.0, 567.2 299.0, 604.5 281.0, 571.0 280.0, 532.0 288.0, 580.0
Viral Load (Log10 copies/mL) < 0.01
   Median 4.7 4.5 5.1 4.7 4.7
   Q1, Q3 4.1, 5.2 4.0, 5.1 4.7, 5.5 4.1, 5.2 4.1, 5.2

2. Plot TCS number at different regions by recency categories.

df_tcs <- dfall %>% select(c(
                             "recent_cat",
                             "TCS_RT",
                             "TCS_PR",
                             "TCS_IN",
                             "TCS_V1V3"
                             )
)

tcs_chart <- function(cat, title) {
  df_tcs %>% 
  ggplot(aes(x = recent_cat, y = cat)) + 
  geom_violin() + 
  geom_jitter(aes(colour = recent_cat), size = 1, alpha = 0.5) + 
  scale_y_continuous(name = title, trans = 'log10') +
  labs(x = "Recency Category", color = "Recency Category") + 
  theme_bw() + 
    theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())
}

p1 <- tcs_chart(df_tcs$TCS_PR, "TCS# PR")
p2 <- tcs_chart(df_tcs$TCS_IN, "TCS# IN")
p3 <- tcs_chart(df_tcs$TCS_RT, "TCS# RT")
p4 <- tcs_chart(df_tcs$TCS_V1V3, "TCS# V1V3")

(p1 | p2) /
  (p3 | p4)

3. Making beautiful phylogenetic trees.

Example Tree 1 Example trees 1

Example Tree 2 Example tree 2